{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Getting Started - Accelerate Your Scripts with nvFuser\n\n**Authors**: [Christian Sarofeen](https://github.com/csarofeen)\n[Piotr Bialecki](https://github.com/ptrblck)\n[Kevin Stephano](https://github.com/kevinstephano)\n[Jie Jiang](https://github.com/jjsjann123)\n[Masaki Kozuki](https://github.com/crcrpar)\n`Neal Vaidya`\n\n\n## Introduction\n\nThis tutorial will demonstrate how you can accelerate your networks\nwith nvFuser. nvFuser is a Deep Learning Compiler that just-in-time\ncompiles fast and flexible GPU specific code to reliably accelerate\nusers' networks automatically, providing speedups for deep learning\nnetworks running on Volta and later CUDA accelerators by generating\nfast custom \u201cfusion\u201d kernels at runtime. nvFuser is specifically\ndesigned to meet the unique requirements of the PyTorch community,\nand it supports diverse network architectures and programs with\ndynamic inputs of varying shapes and strides.\n\n## Importing Packages and Selecting a Device\nIn order to run this tutorial and see the benefits of using nvFuser,\nyou would need to install the `1.12.0` PyTorch release as well as\n`functorch` `0.2` or newer version of them. `functorch` also needs\n`networkx` for its smart recomputation heuristics which you can\ninstall via `pip install networkx`.\nAdditionally, a GPU is required.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch\nimport torch.nn.functional as F\nimport functorch\nfrom functorch.compile import memory_efficient_fusion\nfrom copy import deepcopy\nfrom typing import List\nimport time\nimport functools\nimport random\n\nrandom.seed(42)\n\nif torch.__version__ < (1, 12, 0):\n    raise RuntimeError(\n        \"PyTorch >= 1.12.0 required, but your environment uses torch=={}\".format(\n            torch.__version__\n        )\n    )\n\nmajor, minor, _ = functorch.__version__.split(\".\")\nif int(major) == 0 and int(minor) < 2:\n    raise RuntimeError(\n        \"FuncTorch >= 0.2.0 required, but your environment uses functorch=={}\".format(\n            functorch.__version__\n        )\n    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## The Transformer Block\nThe network topology we\u2019re going to focus on is the Transformer\nBlock for networks like BERT. As of writing this tutorial, nvFuser\nprovides acceleration of pointwise, reduction, and normalization\noperations. These simple operations are the backbone of large\nnetworks, so improving the speed of these operations can improve\noverall network training speed. Future releases of nvFuser will\nimprove the performance of Linear Layers, but for now we will\nspecifically look at the Bias-Dropout-Add-LayerNorm section of this\nTransformer Block.\n\n.. figure:: /_static/img/nvfuser_intro/nvfuser_transformer_block.png\n\nFirst, let\u2019s define the forward pass for this section of our network.\nFor when we\u2019ll use TorchScript on this function, we decorate the\nfunction with type information of the function parameters. This isn\u2019t\nalways required, but it can often help to provide this information to\nTorchScript because it is a strictly typed system. Since we have\nPyTorch\u2019s autograd system, we don\u2019t need to explicitly define the\nbackwards pass.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def composite_definition(\n    input1: torch.Tensor,\n    input2: torch.Tensor,\n    weight: torch.Tensor,\n    bias1: torch.Tensor,\n    bias2: torch.Tensor,\n    normalization_axis: int,\n    dropout_prob: float,\n) -> torch.Tensor:\n    bias1_out = input1 + bias1\n    dropout_out = F.dropout(bias1_out, dropout_prob, training=True)\n    norm_input = dropout_out + input2\n    norm_output = F.layer_norm(\n        norm_input, (input1.size(normalization_axis),), weight, bias2\n    )\n    return norm_output"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Setup and Performance Metrics\nNext, we initialize some inputs, parameters, and a simulated gradient\noutput tensor for the backwards pass since we aren\u2019t including a\nloss function.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Setup initial tensors and parameters\ninput_size = [64, 128, 1024]\ndevice = \"cuda\"\ndtype = torch.float32\n\n# Create sample inputs\ninput1 = torch.randn(*input_size, device=device, dtype=dtype, requires_grad=True)\ninput2 = torch.rand_like(input1).requires_grad_()\n\n# Precompute a grad output tensor, for this example it's the same size\n# as the inputs\ngrad_output = torch.rand_like(input1)\n\n# Randomly initialize the model parameters\nweight = torch.nn.Parameter(torch.randn(input_size[2], dtype=dtype, device=device))\nbias1 = torch.nn.Parameter(torch.randn(input_size[2], dtype=dtype, device=device))\nbias2 = torch.nn.Parameter(torch.randn(input_size[2], dtype=dtype, device=device))\n\nparameters = [input1, input2, weight, bias1, bias2]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To produce a baseline performance we will measure the speed of our\nforward and backward passes in PyTorch\u2019s default eager mode. To get\naccurate and comparable measurements, we perform a few warm up\niterations. Then, we time many iterations of the forward and backward\npass using performance counters combined with proper GPU\nsynchronization, then compute the average iterations per second.\nIt\u2019s important to be very careful when measuring performance on the\nGPU, as we want to remove any initialization costs and need\nsynchronization since it\u2019s an asynchronous device. Since we will\nmeasure many variations of this problem with and without nvFuser we\ndefine a helper method called `profile_workload` and will use\n`functool.partial` to concisely profile the workload.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Utility to profile the workload\ndef profile_workload(forward_func, grad_output, iteration_count=100, label=\"\"):\n    # Perform warm-up iterations\n    for _ in range(3):\n        # Run model, forward and backward\n        output = forward_func()\n        output.backward(grad_output)\n        # delete gradiens to avoid profiling the gradient accumulation\n        for p in parameters:\n            p.grad = None\n\n    # Synchronize the GPU before starting the timer\n    torch.cuda.synchronize()\n    start = time.perf_counter()\n    for _ in range(iteration_count):\n        # Run model, forward and backward\n        output = forward_func()\n        output.backward(grad_output)\n        # delete gradiens to avoid profiling the gradient accumulation\n        for p in parameters:\n            p.grad = None\n\n    # Synchronize the GPU before stopping the timer\n    torch.cuda.synchronize()\n    stop = time.perf_counter()\n    iters_per_second = iteration_count / (stop - start)\n    if label:\n        print(label)\n    print(\"Average iterations per second: {:.2f}\".format(iters_per_second))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can now measure a baseline performance of PyTorch\u2019s eager mode\n(without nvFuser).\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Run and profile eager mode execution on the composite definition of our\n# operations.\nfunc = functools.partial(\n    composite_definition,\n    input1,\n    input2,\n    weight,\n    bias1,\n    bias2,\n    normalization_axis=2,\n    dropout_prob=0.1,\n)\nprofile_workload(\n    func, grad_output, iteration_count=100, label=\"Eager Mode - Composite definition\"\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "It\u2019s important for PyTorch and nvFuser to work well across diverse\nGPU architectures. For our measurements we\u2019ve run this tutorial on\nfive GPUs ranging from consumer to enterprise grade. Our baseline\ngeometric mean (geomean) performance across these GPUs is 850\niterations per second, plotted in the figure below.\n\n.. figure:: /_static/img/nvfuser_intro/nvfuser_tutorial_0.png\nAs we run different variations of this script with nvFuser, we will\ncontinue to add the results to this figure for the same GPUs.\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## TorchScript & nvFuser\nnvFuser is the default fusion system in TorchScript since PyTorch\nversion 1.12, so to turn on nvFuser we need to enable TorchScript.\nThis will allow nvFuser to automatically generate fast kernels and\ntake over execution of these operations. TorchScript can be a\nchallenging system to get working, but with our current definition\nof our operators, all we need to do is wrap our function in the\n`torch.jit.script` compile function. We can then simply run our\nworkload as before.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "scripted_composite_definition = torch.jit.script(composite_definition)\nfunc = functools.partial(\n    scripted_composite_definition,\n    input1,\n    input2,\n    weight,\n    bias1,\n    bias2,\n    normalization_axis=2,\n    dropout_prob=0.1,\n)\nprofile_workload(\n    func, grad_output, iteration_count=100, label=\"TorchScript - Composite definition\"\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Before we get to the results, it is important to mention here that\nnvFuser does not generate the exact same sequence of random numbers,\nas random number generation in PyTorch is dependent on the precise\nparallelization scheme used for the GPU function. Therefore, if you\nwant to validate the output of nvFuser with the output without\nnvFuser, it would require disabling the random number generation\nfunctions. In this example, we would simply need to change\n`dropout_out = F.dropout(bias1_out, dropout_prob, training=True)`\nto\n`dropout_out = F.dropout(bias1_out, dropout_prob, training=False)`\nas the dropout function is the only function in this example that\ndepends on random number generation.\n\n.. figure:: /_static/img/nvfuser_intro/nvfuser_tutorial_1.png\n\nOur geomean performance with nvFuser is 1,394 images per second\nwhich is a geomean of 1.64x faster than eager mode. We did not\ninclude the time that TorchScript and nvFuser take to compile the\nprogram and GPU functions. For real end-to-end training the\ncompile time of TorchScript and nvFuser are negligible. For\nexample, in this tutorial the combination of TorchScript and\nnvFuser took around 2.4s in total to compile these high speed\nGPU functions.\n\nnvFuser\u2019s capabilities extend well beyond this initial performance gain.\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## nvFuser & Dynamic Shapes\nIt is challenging for Deep Learning Compilers to provide performance\ngains when the user changes the input sizes of the tensors. However,\nsupporting changing shapes has always been a fundamental design\ncriteria for nvFuser, as processing different-sized input tensors is\ncritical to many applications like Natural Language Processing and\nGraph Neural Networks.\n\nTo use nvFuser on inputs that change shape from iteration, we\ngenerate new input and output gradient tensors and make a few\ndifferent sizes. Since the last dimension is shared with the\nparameters and cannot be changed dynamically in LayerNorm, we\nperturb the first two dimensions of the input and gradient tensors.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "SHAPE_COUNT = 20\ndynamic_sizes = deepcopy(input_size)\n\ninputs1: List[torch.Tensor] = []\ninputs2: List[torch.Tensor] = []\ngrad_outputs: List[torch.Tensor] = []\n\n\n# Create some random shapes\nfor _ in range(SHAPE_COUNT):\n    dynamic_sizes[0] = input_size[0] + random.randrange(-2, 3)\n    dynamic_sizes[1] = input_size[1] + random.randrange(-2, 3)\n    input = torch.randn(*dynamic_sizes, device=device, dtype=dtype, requires_grad=True)\n    inputs1.append(input)\n    inputs2.append(torch.rand_like(input))\n    grad_outputs.append(torch.rand_like(input))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "No changes from before are required for running with TorchScript, we\nsimply reuse the previous definition that we wrapped in\n`torch.jit.script`.\n\nWe\u2019ll start as usual by performing some warm-up iterations, however\nwe won\u2019t show nvFuser all of the input sizes, we\u2019ll only show one\nsize for the warm-up.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Perform warm-up iterations\nfor _ in range(3):\n    dynamic_input1 = inputs1[0]\n    dynamic_input2 = inputs2[0]\n    dynamic_grad_output = grad_outputs[0]\n    # Run model, forward and backward\n    output = scripted_composite_definition(\n        dynamic_input1,\n        dynamic_input2,\n        weight,\n        bias1,\n        bias2,\n        normalization_axis=2,\n        dropout_prob=0.1,\n    )\n    output.backward(dynamic_grad_output)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, we can measure the performance metrics of nvFuser as we have\npreviously.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Profile manually as our helper function expects static inputs\niteration_count = 100\n# Synchronize the GPU before starting the timer\ntorch.cuda.synchronize()\nstart = time.perf_counter()\nfor i in range(iteration_count):\n    dynamic_input1 = inputs1[i % SHAPE_COUNT]\n    dynamic_input2 = inputs2[i % SHAPE_COUNT]\n    dynamic_grad_output = grad_outputs[i % SHAPE_COUNT]\n    dynamic_parameters = [dynamic_input1, dynamic_input2, weight, bias1, bias2]\n\n    # Run model, forward and backward\n    output = scripted_composite_definition(\n        dynamic_input1,\n        dynamic_input2,\n        weight,\n        bias1,\n        bias2,\n        normalization_axis=2,\n        dropout_prob=0.1,\n    )\n    output.backward(dynamic_grad_output)\n    # Delete the gradients to avoid profiling the gradient accumulation\n    for p in dynamic_parameters:\n        p.grad = None\n\n# Synchronize the GPU before stopping the timer\ntorch.cuda.synchronize()\nstop = time.perf_counter()\niters_per_second = iteration_count / (stop - start)\nprint(\"TorchScript - Random Sizes\")\nprint(\"Average iterations per second: {:.2f}\".format(iters_per_second))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Performance across our GPUs is very similar to the previous\nperformance seen. Only the performance of the A100 degraded\nslightly, but is still much higher than without nvFuser. The small\nchange in performance of the A100 is actually related to the\nadditional CPU overhead that dynamic shapes cause in nvFuser.\nnvFuser at runtime has to infer how to run the different sized\nkernels, so additional CPU time is consumed. This CPU time is\npresent with all GPUs, but since the A100 runs its functions so fast\nthis CPU overhead cannot be fully hidden by the asynchronous nature\nof GPU execution.\n\n.. figure:: /_static/img/nvfuser_intro/nvfuser_tutorial_2.png\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>Today, nvFuser in TorchScript is the only exposure of\n          nvFuser that allows for dynamic shape changes, although we will\n          expand this capability to other systems in the future. For more\n          insight into how dynamic shapes are implemented in nvFuser, you can\n          view this presentation from GTC 2021:\n          https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31952/</p></div>\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Defining novel operations with nvFuser and FuncTorch\n\nOne of the primary benefits of nvFuser is the ability to define\nnovel operations composed of PyTorch \u201cprimitives\u201d which are then\njust-in-time compiled into efficient kernels.\n\nPyTorch has strong performance for any individual operation,\nespecially composite operations like LayerNorm. However, if\nLayerNorm wasn\u2019t already implemented in PyTorch as a composite\noperation, then you\u2019d have to define it as a series of simpler\n(primitive) operations. Let\u2019s make such a definition and run it\nwithout nvFuser.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def primitive_definition(\n    input1: torch.Tensor,\n    input2: torch.Tensor,\n    weight: torch.Tensor,\n    bias1: torch.Tensor,\n    bias2: torch.Tensor,\n    normalization_axis: int,\n    dropout_prob: float,\n    keepdim: bool,\n) -> torch.Tensor:\n    bias1_out = input1 + bias1\n    dropout_out = F.dropout(bias1_out, dropout_prob, training=True)\n    norm_input = dropout_out + input2\n    mean = norm_input.mean(normalization_axis, keepdim=keepdim)\n    diff = norm_input - mean\n    diff_sq = diff * diff\n    var = diff_sq.mean(normalization_axis, keepdim=keepdim)\n    pre_shift_scale_norm_output = (norm_input - mean) / torch.sqrt(var + 1e-12)\n    norm_output = weight * pre_shift_scale_norm_output + bias2\n    return norm_output\n\n\n# Profile primitive definition\nfunc = functools.partial(\n    primitive_definition,\n    input1,\n    input2,\n    weight,\n    bias1,\n    bias2,\n    normalization_axis=2,\n    dropout_prob=0.1,\n    keepdim=True,\n)\nprofile_workload(\n    func, grad_output, iteration_count=100, label=\"Eager Mode - Primitive Definition\"\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "While the above is mathematically equivalent to our previous\ndefinition, benchmarking our new function with the original static\nshape using TorchScript and nvFuser shows the iterations per second\ndecreases \u2013 mostly due to the cost of accessing memory to save\nintermediate results.\n\n.. figure:: /_static/img/nvfuser_intro/nvfuser_tutorial_3.png\n\nThe geomean iterations per second is 260 iterations per second,\n3.26x slower than the composite definition in eager mode and 5.35x\nslower than the nvFuser composite operation! For more information on\nwhy there\u2019s such a drastic decrease in compute speed please see this\npresentation from GTC 2022:\nhttps://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41958/\n\nnvFuser with TorchScript can improve the performance of this\noperation even though it\u2019s defined with primitive PyTorch\noperations. Simply by enabling TorchScript on the new function\n(just like before), we can see much of the performance returns.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Profile scripted primitive definition\nscripted_primitive_definition = torch.jit.script(primitive_definition)\nfunc = functools.partial(\n    scripted_primitive_definition,\n    input1,\n    input2,\n    weight,\n    bias1,\n    bias2,\n    normalization_axis=2,\n    dropout_prob=0.1,\n    keepdim=True,\n)\nprofile_workload(\n    func, grad_output, iteration_count=100, label=\"TorchScript - Primitive definition\"\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        ".. figure:: /_static/img/nvfuser_intro/nvfuser_tutorial_4.png\n\nHowever, the performance is still slower than the original eager\nmode performance of the composite definition. TorchScript works well\nwhen predefined composite operations are used, however TorchScript\u2019s\napplication of Autograd saves all of the activations for each\noperator in the fusion for re-use in the backwards pass. However,\nthis is not typically the optimal choice. Especially when chaining\ntogether multiple simple operations, it is often much faster to\nrecompute some intermediate tensors rather than spend the time\nstoring and retrieving several saved results from memory.\n\nIt\u2019s possible to optimize away many of these unnecessary memory\naccesses, but it requires building a connected forward and backward\ngraph which isn\u2019t possible with TorchScript. The\n`memory_efficient_fusion` pass in FuncTorch, however, is such an\noptimization pass. To use this pass, we have to redefine our\nfunction to pull the constants inside (for now it\u2019s easiest to make\nnon-tensor constants literals in the function definition):\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def primitive_definition_for_memory_efficient_fusion(\n    input1: torch.Tensor,\n    input2: torch.Tensor,\n    weight: torch.Tensor,\n    bias1: torch.Tensor,\n    bias2: torch.Tensor,\n) -> torch.Tensor:\n    bias1_out = input1 + bias1\n    dropout_out = F.dropout(bias1_out, 0.1, training=True)\n    norm_input = dropout_out + input2\n    mean = norm_input.mean(2, keepdim=True)\n    diff = norm_input - mean\n    diff_sq = diff * diff\n    var = diff_sq.mean(2, keepdim=True)\n    pre_shift_scale_norm_output = (norm_input - mean) / torch.sqrt(var + 1e-12)\n    norm_output = weight * pre_shift_scale_norm_output + bias2\n    return norm_output"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, instead of passing our function to TorchScript, we will pass it\nto FuncTorch\u2019s optimization pass.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Optimize the model with FuncTorch tracing and the memory efficiency\n# optimization pass\nmemory_efficient_primitive_definition = memory_efficient_fusion(\n    primitive_definition_for_memory_efficient_fusion\n)\n\n# Profile memory efficient primitive definition\nfunc = functools.partial(\n    memory_efficient_primitive_definition, input1, input2, weight, bias1, bias2\n)\nprofile_workload(\n    func,\n    grad_output,\n    iteration_count=100,\n    label=\"FuncTorch - Primitive definition\",\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This recovers even more speed, but it\u2019s still not as fast as\nTorchScripts original performance with the composite definition.\nHowever, this is still faster than running this new definition\nwithout nvFuser, and is still faster than the composite definition\nwithout nvFuser.\n\n.. figure:: /_static/img/nvfuser_intro/nvfuser_tutorial_5.png\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>FuncTorch\u2019s memory efficient pass is experimental and still\n          actively in development.\n          Future versions of the API are expected to achieve performance\n          closer to that of TorchScript with the composite definition.</p></div>\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>FuncTorch\u2019s memory efficient pass specializes on the shapes of\n          the inputs to the function. If new inputs are provided with\n          different shapes, then you need to construct a new function\n          using `memory_efficient_fusion` and apply it to the new inputs.</p></div>\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Transformer Block With a Novel Normalization\nThe ability to quickly execute chains of simple operations is\nimportant as not every operation has a composite operation defined\nin PyTorch. Previously, this meant researchers either had to define\nan entirely new operation in PyTorch \u2013 which takes a lot of time and\nknowledge of the lower-level PyTorch code as well as parallel\nprogramming \u2013 or writing the operation in simpler PyTorch ops and\nsettling for poor performance. For example, let's replace LayerNorm\nin our example with RMSNorm. Even though RMSNorm is a bit simpler\nthan LayerNorm, it doesn\u2019t have an existing compound operation in\nPyTorch. See the [Root Mean Square Layer Normalization](https://doi.org/10.48550/arXiv.1910.07467)_ paper for more information about RMSNorm.\nAs before, we\u2019ll define our new transformer block with\nprimitive PyTorch operations.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def with_rms_norm(\n    input1: torch.Tensor,\n    input2: torch.Tensor,\n    weight: torch.Tensor,\n    bias: torch.Tensor,\n    normalization_axis: int,\n    dropout_prob: float,\n    keepdim: bool,\n) -> torch.Tensor:\n    bias_out = input1 + bias\n    dropout_out = F.dropout(bias_out, dropout_prob, training=True)\n    norm_input = dropout_out + input2\n    var = norm_input.mul(norm_input).mean(normalization_axis, keepdim)\n    pre_shift_scale_norm_output = norm_input / torch.sqrt(var + 1e-12)\n    norm_output = weight * pre_shift_scale_norm_output\n    return norm_output"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As before, we\u2019ll get a baseline by running PyTorch without nvFuser.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Profile rms_norm\nfunc = functools.partial(\n    with_rms_norm,\n    input1,\n    input2,\n    weight,\n    bias1,\n    normalization_axis=2,\n    dropout_prob=0.1,\n    keepdim=True,\n)\nprofile_workload(func, grad_output, iteration_count=100, label=\"Eager Mode - RMS Norm\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "With nvFuser through TorchScript.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Profile scripted rms_norm\nscripted_with_rms_norm = torch.jit.script(with_rms_norm)\nfunc = functools.partial(\n    scripted_with_rms_norm,\n    input1,\n    input2,\n    weight,\n    bias1,\n    normalization_axis=2,\n    dropout_prob=0.1,\n    keepdim=True,\n)\nprofile_workload(func, grad_output, iteration_count=100, label=\"TorchScript - RMS Norm\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "With nvFuser through Functorch.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def with_rms_norm_for_memory_efficient_fusion(\n    input1: torch.Tensor, input2: torch.Tensor, weight: torch.Tensor, bias: torch.Tensor\n) -> torch.Tensor:\n    bias_out = input1 + bias\n    dropout_out = torch.nn.functional.dropout(bias_out, 0.1)\n    norm_input = dropout_out + input2\n    var = norm_input.mul(norm_input).mean(2, keepdim=True)\n    pre_shift_scale_norm_output = norm_input / torch.sqrt(var + 1e-12)\n    norm_output = weight * pre_shift_scale_norm_output\n    return norm_output\n\n\n# Profile memory efficient rms_norm\nmemory_efficient_rms_norm = memory_efficient_fusion(\n    with_rms_norm_for_memory_efficient_fusion\n)\nfunc = functools.partial(memory_efficient_rms_norm, input1, input2, weight, bias1)\nprofile_workload(func, grad_output, iteration_count=100, label=\"FuncTorch - RMS Norm\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        ".. figure:: /_static/img/nvfuser_intro/nvfuser_tutorial_6.png\n\nSince RMSNorm is simpler than LayerNorm the performance of our new\ntransformer block is a little higher than the primitive definition\nwithout nvFuser (354 iterations per second compared with 260\niterations per second). With TorchScript, the iterations per second\nincreases by 2.68x and 3.36x to 952 iterations per second and 1,191\niterations per second with TorchScript and FuncTorch\u2019s memory\nefficient optimization pass, respectively. The performance of this\nnew operation nearly matches the performance of the composite Layer\nNorm definition with TorchScript.\n\nnvFuser is here to provide the ability to define novel operations in\nsimple PyTorch and get performance that\u2019s close to a highly optimized\ncomposite operation in PyTorch. We believe this will enable research\ninto novel network topologies without paying for sometimes devastating\neffects on speed of training. nvFuser provides this unique ability as\nit\u2019s able to analyze users\u2019 programs to provide performance as fast as a\nhighly hand tuned implementation, regardless of how the operations are\ndefined. nvFuser still cannot support every operation in PyTorch,\nhowever its capabilities will continue to grow over time.\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}